feat: add job lifecycle tracking and retry logic for stuck jobs#49
Open
laurenchurch wants to merge 3 commits intomacstadium:mainfrom
Open
feat: add job lifecycle tracking and retry logic for stuck jobs#49laurenchurch wants to merge 3 commits intomacstadium:mainfrom
laurenchurch wants to merge 3 commits intomacstadium:mainfrom
Conversation
- Track jobs immediately upon acquisition to monitor full lifecycle - Implement max 3 provisioning retries with 15-second intervals - Add background monitoring for jobs stuck over 5 minutes - Enhanced logging at each job state transition - Add 8 unit tests for job tracking and concurrent access safety Fixes issues where jobs remain stuck indefinitely when: - Provisioning fails and jobs stay acquired without retry - Jobs are acquired but never assigned by GitHub Actions - Container crashes leave jobs in limbo state Tests: make test passes with 23 total specs Lint: make lint passes with no issues
…eout Jobs stuck for >5 minutes are now automatically removed from tracking and marked as canceled, allowing VM cleanup to proceed. Added 10-minute timeout per provisioning attempt to prevent indefinite hangs from SSH or network issues.
fix: implement active cleanup for stuck jobs and add provisioning timeout
Collaborator
|
Thank you for your contribution. There are a several important things that are worth disussing:
As you can see the second job overwrote the first one, instead of creating its own entry. |
ispasov
reviewed
Jan 19, 2026
| p.logger.Infof("Provisioning runner for job %s (RunnerRequestId: %d), attempt %d/%d", jobId, runnerRequestId, attempt, maxProvisioningRetries) | ||
|
|
||
| // Create timeout context for this provisioning attempt | ||
| provisionCtx, cancel := context.WithTimeout(p.ctx, provisioningTimeout) |
Collaborator
There was a problem hiding this comment.
This timeout is not only for the provisioning, but also for the job duration.
Provisioning does the following:
- It creates the VM
- It creates the runner
- It runs he job
- It deletes the VM
If the timeout is hit before all of these operations are finished, it is possible that they are not completed.
Imagine the following scenario:
- A job run takes 11 minutes
- Everything is configured properly
- VM delete does not happen as the context is already cancelled
- There is an orphaned VM in the cluster that will not be removed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes job queuing issues and stale VM cleanup where jobs remain stuck indefinitely when provisioning fails or when acquired jobs never receive assignment messages from GitHub Actions.
Problem
Jobs could get stuck in GitHub's queue in multiple scenarios:
Related to MacStadium ticket SERVICE-203600
Solution
1. Job Lifecycle Tracking
2. Provisioning Retry Logic
3. Stuck Job Monitoring & Cleanup
4. Provisioning Timeout Protection
Changes
Files Modified:
pkg/github/runners/types.go- AddedAcquiredJobInfostruct and tracking fieldspkg/github/runners/message-processor.go- Core implementation with active cleanuppkg/github/runners/message-processor_test.go- Unit tests (10 new tests)Testing